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High-capacity transmission systems usually include one or more hot 
spares for protection. When a regular transmission channel fails, its 
signal is rapidly transferred to the spare channel under the control of 
protection switching circuits so that there is little signal degradation 
or interruption. This paper studies the reliability of a microproces- 
sor-based terminal protection switching system. Some new and inter- 
esting behavior patterns for transmission systems with automatic 
protection switching are revealed. Also, some new memory self-checking 
algorithms are presented which increase the capability of micropro- 
cessor system fault recognition. 

I. INTRODUCTION 

In high-capacity transmission systems, any failure may affect a large 
number of message circuits. Such systems usually include one or more 
hot spares to increase system reliability. When a regular transmission 
channel fails, its signal is rapidly transferred to the spare channel under 
the control of protection switching circuits so that there is little signal 
degradation or interruption. This paper studies the reliability of a mi- 
croprocessor-based terminal protection switching system (TPSS). The 
specific transmission facility under consideration is the L5E coaxial cable 
analog system, which is an expanded version of the L5 system. 1 The L5E 
multiplex equipment, or multimastergroup translators (MMGT), carry 
up to eight mastergroups, or 4800 telephone circuits. The TPSS will au- 
tomatically switch into service a protection MMGT in the event of a 
failure of any one of up to 20 MMGTs. 

Reliability theory has been studied by numerous authors, 2,3 and al- 
most every Bell System transmission facility with automatic protection 
switching has been the subject of at least one reliability study. 4 - 5 The 
present analysis was undertaken for several reasons. First, many sim- 
plifying assumptions were made in the previous studies. Not all the 
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effects of the reliability of the switch, the protection switching control 
circuit, and the monitor circuit failures were taken into account. Second, 
in most cases, exponentially distributed restoration time has been as- 
sumed. This means that the probability of restoration at any instant after 
a failure is assumed to be independent of how much time has already 
been spent on restoring the failure. This assumption is rarely true in 
high-capacity transmission systems. Third, only steady-state analyses 
were made. A system with hidden failures will not reach its steady state 
in its lifetime. Fourth, a microprocessor-based protection switching 
control circuit has not been studied in such detail before. Finally, past 
experiences have shown that maintenance-induced service outages 
contribute to a very big share of the total outage time. This study also 
tries to take these outages into consideration. 

With the MMGT system as an example, the present study attempts 
to analyze the same reliability problem in more detail and with less re- 
strictive assumptions. Section II describes the protection switching ar- 
rangement. Section III explains the specific approaches used in this 
paper. Section IV presents the results graphically to emphasize the 
various reliability trends. Section V summarizes the conclusions ob- 
tained. Appendix A investigates some new microprocessor self-checking 
algorithms and Appendix B presents the derivations. 

II. MMGT PROTECTION SWITCHING SYSTEM DESCRIPTION 

Figure 1 is a simplified MMGT-system block diagram which illustrates 
the 1 X n protection switching arrangement. There is one protection 
channel in each direction of transmission. Under the command of the 
microprocessor, each protection channel protects up to n regular chan- 
nels, where n is equal to 20 in the TPSS. The same processor is used to 
control the switching actions of both directions of transmission. The 
switches are all solid-state devices, and their normal states are indicated 
in the figure. The crucial output switches are dual-powered. Parts of the 
output switch are designated the through switch and the substitute 
switch for later reference. 

When there is no alarm from the various regular pilot detectors, the 
processor exercises the input switches for each channel sequentially to 
detect possible protection failures. In the event of a failure of one of the 
regular channels, the corresponding pilot detector sends an alarm to the 
processor. If the protection channel is available, the processor will first 
switch the input signal through the input switches to feed the protection 
channel. Whether the protection detector indicates a good signal or not, 
the processor will complete the 1X2 output switch. The regular detector 
is now monitoring the signal supplied by the protection channel via the 
output substitute switch. If the regular detector still alarms after the 
protection switch, the switching action will be reversed. The 1X2 output 
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switch will be deactivated and the input switch released. If the regular 
detector stops alarming after the output switching, a successful pro- 
tection switch has been made, and the protection detector is monitoring 
the failed regular channel. When the failed channel is repaired, the 
protection detector will see a good signal, and the switches will return 
to their normal states. The protection channel is then free to service 
another regular channel failure. 

Service outages can occur in many ways. In addition to multiple 
transmission failures, they can also be generated by the failures of the 
detectors, the switches, or the microprocessor system. The various failure 
modes are taken into account in later derivations. 

III. APPROACHES 

Two reliability measures of interest in transmission systems are used 
in this study. The first measure is the probability of service outage due 
to equipment failures. This probability translates directly to the system 
outage time per year and is the most commonly used figure of merit in 
determining transmission system reliability. The second measure is the 
probability of having maintenance activities going on. This measure will 
be abbreviated as the probability of activity. It is believed to be closely 
related to the probability of having maintenance-induced outages. This 
probability of activity is greater than the probability of having alarms 
because there are failures that cannot be detected locally. For instance, 
if the pilot detector for a failed regular channel is stuck to the state of 
no alarm, the failure can only be detected by downstream offices. Thus 
there may be maintenance activities in an office but no alarm. The 
probability of activity is less than the probability of having failures be- 
cause there are undetectable failures such as the breakdown of an output 
substitute switch. A reliable system should have a small probability of 
outage and a small probability of activity. 

Two additional criteria are used to measure the effectiveness of the 
overall protection plan. The improvement factor (if) is defined as the 
ratio of the probability of outage without protection switching to that 
with protection switching. The activity factor (af) is defined as the ratio 
of the probability of activity with protection switching to that without 
protection switching. These definitions agree with the common notion 
that an effective protection plan should provide more improvement and 
less activity. Thus, a better protection system has a bigger if and a 
smaller AF. The activity factor is always greater than one. 

The probabilities discussed above are derived under the assumptions 
that the various failures are statistically independent and the failure 
rates are constant. These are very simple assumptions considering the 
complexity of the problem. The assumption of statistical independence 
is made to avoid estimating conditional failures, although there is 
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probably dependency between the through switch and the substitute 
switch. The constant failure rate implies exponentially distributed 
failures, i.e., any working item is as good as new. This is a reasonable 
assumption for solid-state devices after the initial "burn-in" period. 
Notice that no distributional assumption is made on the restoration time. 
Based only on the failure rates and the restoration times of the compo- 
nents of the system, the various probabilities are derived from the basic 
definitions of conditional probability. Not only does this approach re- 
quire little mathematical background, but the result is more general and 
more accurate than the usual method of Markoff chain or birth-and- 
death stochastic processes, 2,3 which assume that both failure and res- 
toration times are exponentially distributed. 

IV. DETAILED RESULTS 

Table I introduces the notations and gives the estimated failure rates 
in FITS (number of failures per component per 10 9 hours), restoration 
times in hours, and the availabilities of the various components. The 
restoration time is the sum of the detection time and the equipment 
replacement time. The mean value of the replacement time t is assumed 
to be 1 hour. Some failure rates are expressed in terms of other failure 
rates to show their relative dependence. This is necessary in later pa- 
rameter sensitivity studies. The failure of a substitute switch can only 
be detected when its use is called for. Thus, its detection time is the mean 
time between transmission failures of its corresponding channel, i.e., 
l/(X r + \ t + X ). The same is true for the detection time of a regular 
detector, except that the assumed probability that a failed detector gives 
a no-alarm indication is 1/4. In both cases, the equipment replacement 
time is ignored since it is small compared with the detection time. 

The detection times of the hidden CPU (central processing unit) and 
EROM (erasable read-only memory) failures should also be similarly 
calculated. However, the failure of the regular channels to be exercised 
sequentially should provide local craftspeople with the indication that 
something is wrong. Therefore, the detection times are assumed to be 
24 hours. The availability 3 of an item is the probability that the item is 
working. It is a function of time with an initial value of one and with a 
steady-state value equal to the mean time to failure divided by the sum 
of the mean time to failure and the mean restoration time. If a compo- 
nent has a short failure detection time, the transient portion in its 
availability value vanishes quickly, and the steady-state theoretical 
availability approximates the actual availability very well. For example, 
the steady-state availability of the regular channel is p r = 1/1.000001. 
The reliability function of the regular channel is e -10-6 '. It takes only 
1 hour for the reliability function to reach its steady-state availability 
value. 
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These arguments do not hold for failures requiring long detection 
times. For instance, the mean time to failure and the mean restoration 
time of a substitute switch are in the order of hundreds of years, while 
the life span of the equipment is expected to be only 40 years. To obtain 
an appropriate availability in such cases, one would observe that the 
restoration time of the substitute switch is exponentially distributed. 
This is due to the fact that the replacement time is ignored and the de- 
tection time depends on the transmission failures which are exponen- 
tially distributed. Thus the availability function can be derived explicitly 
as 

A s (t)= \ + — ^—e-lM-*; 1 )* 

1 + X s Hs X s + Ms 

The availability p s given in Table I is the A s (t) averaged over the life 
span T of the equipment. The availability of the detector Pd is obtained 
similarly. The availability expressions of the EROM and the RAM reflect 
the use of 4 EROMs and 2 rams in the TPSS. 

To gain insight and to study the sensitivity of the derived probabilities 
to the estimated failure rates and restoration times, the various estimated 
parameters are varied one at a time to show the system reliability trends. 
The results are presented graphically in the figures. In each figure, the 
solid line corresponds to the ordinate at the left and the dotted line to 
that at the right. 

Figures 2 through 7 present the variations of the outage and the ac- 
tivity probabilities as functions of the regular channel, the detector, the 
switch, the CPU, the EROM, and the RAM failure rates, respectively. Most 
of the curves are almost linear because, for the small failure rates of in- 
terests, they are still in their linear regions. As far as the probability of 
outage is concerned, undetectable failures are the most damaging. The 
hidden detector and the substitute switch failures contribute to the 
bigger slopes in Figs. 3 and 4. Increasing the microprocessor system 
failures adds very little to the outage probability, as can be seen from 
Figs. 5 to 7. The probability that has the fastest increase is the switch 
failure rates because there are so many switches in the system. Figure 
8 indicates that service outage can increase substantially if the re- 
placement time for failed equipment is long. Figure 9 shows the effect 
of varying the detection time of the hidden microprocessor failure. 
Neither the outage nor the activity probability is sensitive to the de- 
tection time. Figure 10 shows the effect of varying the number of regular 
channels equipped. The discrete points in the figure are connected to 
show the almost linear trends. When the system is fully loaded, i.e., n 
= 20, there are about 2 minutes of service outage each year due to 
equipment failures and there is about half an hour of maintenance ac- 
tivities. It should be emphasized that the curves present the right trends 
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Fig. 2— Probabilities of outage and activity as functions of regular channel failure 
rate. 
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Fig. 3 — Probabilities of outage and activity as functions of detector failure rate. 

rather than numerical accuracy. From Fig. 2, if the failure rate of the 
regular channel is increased by ten times, there will be 4 minutes of 
outage and 4 hours of activity each year. Figure 10 shows the two 
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Fig. 4 — Probabilities of outage and activity as functions of through switch failure 
rate. 
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Fig. 5 — Probabilities of outage and activity as functions of CPU failure rate. 

probabilities as functions of the number of regular channels. The discrete 
points are connected to indicate trends. For terminal circuits which 
usually have small failure rates, there is scarcely any need for a second 
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Fig. 6 — Probabilities of outage and activity as functions of EROM failure rate. 
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Fig. 7— Probabilities of outage and activity as functions of RAM failure rate. 

protection channel even when the number of regular channels is 
large. 

A system without protection switching has only the regular channels 
and their corresponding detectors to indicate alarms. The switches and 
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Fig. 8 — Probabilities of outage and activity as functions of equipment replacement 
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Fig. 9 — Probabilities of outage and activity as functions of hidden microprocessor 
failure detection time. 

the microprocessor devices are not required. Thus there is definitely less 
activity in the maintenance offices. Figure 11 shows the trend that, for 
small regular channel failure rates, the IF can be less than unity, i.e., 
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Fig. 10 — Probabilities of outage and activity as functions of number of regular 
channels. 

having protection switching actually causes more service outage. This 
is true when the failure rate of the regular channel is small compared with 
those of the protection switching circuits. Furthermore, protection 
switching generates many more activities at low regular channel failure 
rates. Figure 12 amplifies this fact by examining the 1X1 configuration. 
The IF is so small and the AF is so big that implementation of a 1 X 1 
protection plan is questionable at low failure rates. Figure 13 gives the 
variations of the two factors with detector failure rates. Since detector 
failures have little effect on the outage probability of an unprotected 
system, the IF decreases with increasing detector failure. The interesting 
shape of the AF curve is due to the relatively rapid increase in the prob- 
ability of activity for an unprotected system when the detector failure 
rates are small. This behavior is unique to the variation of the detector 
failure rate because an unprotected system is equipped only with the 
transmission channels and the detectors. 

Figure 14 again indicates the important role played by the output 
switch. If its failure rate is high enough, the IF can reduce to less than 
unity. With a perfect switch, the outage of a protected system can be 
hundreds of times less than that of an unprotected system. The curves 
showing the two factors as functions of the CPU, the erom, and the RAM 
failure rates are not given here. These curves can be simply deduced from 
Figs. 5 to 7 because the various probabilities of an unprotected system 
are independent of microprocessor failures. Similarly, the factors in- 
volving hidden microprocessor failure restoration time can be obtained 
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from Fig. 9. Figure 15 shows that both the IF and the AF are not very 
sensitive to how long it takes to replace failed equipment. Figure 16 varies 
the number of regular channels. It indicates that more than 10 regular 
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Fig. 11 — Improvement and activity factors as functions of regular channel failure 
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Fig. 12 — Improvement and activity factors as functions of regular channel failure 
rates. 
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channels should be used to take advantage of the protection switching 
arrangement. 

Figure 17 exhibits an interesting behavior of general protection 
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Fig. 13 — Improvement and activity factors as functions of detector failure rates. 
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Fig. 14 — Improvement and activity factors as functions of through switch failure 
rates. 
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switching systems. As the failure rate of the regular channel increases, 
the IF increases from less than one to a maximum and then starts to 
decrease. When the failure rate becomes very large, the outage proba- 
bility is close to 1 with or without protection switching. Thus the IF ap- 
proaches 1 eventually. The maximum IF shown in the figure occurs at 
around 150,000 FITS. Although it is unlikely for a terminal multiplexer 
to possess so high a failure rate, a line transmission system with many 
cascading repeaters may very well have a failure rate of this order. 
Therefore, whenever a line protection switching system is planned, the 
reliability should be studied to determine the length of the protection 
span so that the IF does not fall in its decreasing region. Of course, the 
outage probability should also be taken into account to meet any pre- 
scribed service objectives. 

V. CONCLUSIONS 

The reliability of the microprocessor-based TPSS has been studied 
in detail using conditional probability. Consideration of the four criteria; 
i.e., the probability of outage, the probability of activity, the improve- 
ment factor, and the activity factor, should provide an adequate de- 
scription of the effectiveness of the overall protection plan. Several 
conclusions can be drawn from the analysis. First, terminal circuits 
usually have low failure rates so that one protection channel is adequate 
for the protection of many regular channels without having excessive 
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Fig. 15 — Improvement and activity factors as functions of equipment replacement 
time. 
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Fig. 16 — Improvement and activity factors as functions of number of regular 
channels. 

probability of service outage. Second, undetectable failures are usually 
the prime causes for increased outage probability and decreased im- 
provement factor. If preventive maintenance is ever to be carried out, 
the hidden failures should be the principal targets. Third, the micro- 
computer is reliable as a protection switching controller. Although mi- 
croprocessor system failures can cause false switching all by themselves, 
they contribute only a very small amount of the total outage if adequate 
self-checking is implemented. Reliability could be further improved by 
providing hardware interlock logic to guard against an insane micro- 
processor. For example, logic circuit can be provided in the TPSS to 
prevent the operation of an output switch whenever its input switch is 
inactive. Fourth, all the figures indicate that, around the various esti- 
mated failure rates of interest, the outage probabilities increase almost 
linearly with the failure rates. Thus there is no "preferred" range of 
failure rates that any equipment should be designed to. Only the sensi- 
tivities of the outage probabilities to the various estimates are different. 
Fifth, for any TPSS, the implementation of a 1 X 1 protection plan should 
be studied carefully. Even if there is improvement in the outage proba- 
bility due to equipment failure, the increased activity will generate more 
maintenance -induced outages, not to mention increased costs. 

The above comments do not apply in line protection switching sys- 
tems, which have much higher regular channel failure rates because of 
the cascaded repeaters. Finally, Fig. 17 suggests one more consideration 
in determining the length of a line protection switching span. The failure 
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Fig. 17 — Regular channel failure rates as functions of improvement factor. 

rate of the line should preferably not fall into the decreasing region of 
its improvement factor. The last two points are obvious and interesting 
protection switching behavior patterns which seem not to have been 
explictly pointed out before. 

APPENDIX A 

This appendix discusses microprocessor self-test algorithms whose 
purpose is to generate alarms as early as possible to initiate maintenance 
actions. The test should be exhaustive but should not require too much 
additional program memory. An 8-bit microprocessor is used in the TPSS 
application. 

When the power is turned on, the microprocessor immediately per- 
forms a thorough RAM check. Static RAMs are used, so there is no pattern 
sensitivity problem. The checking algorithm is to write the least-sig- 
nificant 8-address bits of each RAM byte into that specific RAM location. 
After all RAM locations are loaded, the contents of each byte are com- 
pared with its least-significant 8-bit address. After a byte is checked, its 
contents are complemented and checked again. The complemented 
contents will remain in those bytes already checked. This algorithm is 
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able to detect any bit, any data pin, and any combination of address pins 
stuck to zero or one. It can also discover data and address lines shorted 
together. Thus most RAM failures can be detected. 

The ROMs are checked immediately following the RAM check. Two 
consecutive bytes in each ROM are reserved for self-test. One byte is used 
for parity check and the other for short-circuits in address and data lines. 
The microprocessor reads out every byte in the ROM and performs a 
cumulative odd parity check through an exclusive-OR operation on each 
bit. It will be seen first that, as far as independent ROM bit failures are 
concerned, it is adequate to use only one byte to check the parity of all 
ROMs no matter how many ROMs are used in the system. Let £ be the 
number of ROM bytes (excluding the reserved checking byte) used in the 
system and e be the probability of a ROM bit failure. The probability of 
having parity violations is 1 — (1 — p) 8 , where p is 6 



'-"-"' Ml-,!. ""-* 1 ' 



X.l 



The probability of having bit errors is simply 1 — (1 — e) ( ^ +1)x8 . For £t 
« 1, both probabilities can be approximated by 8 X {£ + 1) X e. Thus 
the single byte parity check is adequate when le « 1. It can be seen below 
that this condition is always valid in practice. Since the experimental 
failure rate of the lK-byte EROM is 300 FITS, the failure rate of each bit 
cannot be more than 300/(8 X 1024) « 0.037 FIT. If a ROM failure can be 
discovered in 24 hours, then e < 10~ 9 . The number £ is limited by the 
microprocessor addressing capability which is 64K. Therefore, £e « 1. 
The reason that one parity byte is used in each ROM is to detect address 
and data lines stuck to one or zero. Since the ROM has a capacity equal 
to a power of 2, a stuck output looks like an even number of ones or zeros 
and violates the odd parity. A stuck address will cause half the bytes to 
be read twice and again violate the odd parity. 

The contents of the bits of the other byte used for self -test are alter- 
nating ones and zeros. When this byte is read, short-circuits in data lines 
are detected. If this byte is located at an address whose 10 least-signifi- 
cant address bits are alternating ones and zeros, reading this byte will 
most likely detect short-circuits among these address lines. The prob- 
ability is very small that within the same ROM another byte which also 
contains alternating ones and zeros is read because of shorted address 
lines. To detect some of the short-circuits in the remaining six most 
significant address lines, complemented numbers are stored in these 
checking bytes according to their address parities. Each ROM can select 
one of two hexidecimal numbers, AA or 55, to store at one of two ad- 
dresses. For the first ROM with 0000 starting address, the two addresses 
are 0155 and 02AA. 

The two consecutive checking bytes must be preceded by a jump or 
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branch instruction to bypass them in normal program execution. It is 
obvious that, if a single parity checking byte is located at an address with 
alternating ones and zeros, it alone can detect all ROM failures mentioned 
above except shorted data lines. It is sometimes possible to make use of 
the opcode and the operand of the jump or branch instruction to check 
the snorted data lines. If any failure occurs in the first ROM where the 
checking program is stored, the failure cannot always be detected. Du- 
plicating the first ROM may be a possible solution. 

After the two memory tests, a few instructions are exercised to test 
the CPU. Then the microprocessor starts executing the main program. 
Under normal circumstances, the program never comes back to the above 
RAM, ROM, and CPU tests. Different checks are performed in the main 
program. To avoid delaying the program execution, only distributed 
checks on the memory system are made. For example, in going through 
a program loop, only one RAM byte is tested and only one ROM exclusive 
OR is taken. However, the ROM check uses the same algorithm discussed 
above. The RAM check uses alternating ones and zeros which detect only 
shorted data lines and stuck bits because the exhaustive RAM check 
discussed before will destroy the temporary data stored, in addition to 
requiring long execution time. After each cycle of the nonexhaustive RAM 
check, an additional test 7 is made. Zeros are stored in the first RAM byte. 
Ones are stored only in RAM bytes with addresses 2', i = 1,2,- ••. Every 
time all ones are loaded into an address, the contents of the first all-zero 
byte are also checked. The check is also distributed so as not to delay 
normal program execution. Most remaining RAM failures can be dis- 
covered by this additional test. 

The effectiveness of the two RAM checking algorithms discussed above 
is similar. The first one used when turning on the power requires fewer 
steps and is faster. The second one does not destroy any temporary data 
because every check involves at most two RAM bytes (the first byte and 
the 2'th byte) whose contents can be temporarily stored into CPU reg- 
isters. 

No CPU check is performed in the main program. A restarting sanity 
timer is employed to detect CPU failures. Under normal operation, the 
program retriggers the timer at durations shorter than the length of the 
timer. If the timer times out, an alarm is generated and the micropro- 
cessor system will go through its power on restart cycle again. The re- 
starting sanity timer detects complete CPU failures. It can sometimes 
catch other CPU failures (for example, program counter skipping). It also 
reduces the damages that are caused by power transients because it re- 
starts the system. RAM failures sometimes cause the timer to time out. 
ROM failures have similar effects but are more difficult to be self -de- 
tected. Output failures can only be detected by reading back the output 
bits immediately after each output operation. 
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APPENDIX B 

This appendix derives the probabilities of outage and activity with 
and without protection switching. Figure 1 shows the configuration for 
alXn protection switching system in each direction of transmission. 
The microprocessor is responsible for the switching actions of 2rc regular 
channels. The unprotected system has only the regular transmission 
channels plus pilot detectors for alarm. 

The events of interests in deriving the outage probabilities are 

S: service outage without protection switching. 
Sp: service outage with protection switching. 
Gil all regular channels are good. 
G2I both protection channels are good. 
G 3 : all regular detectors are good. 
G4: all through switches are good. 
G5: all substitute switches are good. 
Gq: the microprocessor system is good. 
G 7 : all output switches are good. 

The events G t 's are assumed to be statistically independent. Their 
probabilities are given by 



P\G 1 \=p 2 r 



2n 





P\G 2 \ 


= Pl 








P\G 3 \ ■ 


*pf 








P\G 4 \ - 


*P? 








P\G 5 \ - 


= P S 2 " 








P\G 6 \ = Pm 


= PcPePa 








P[G 7 ] = 


• r\ 2n 

- Po > 






where the notations 


are defined in Table I. The symbol q 


with 


appro- 



priate subscripts is defined to be 1 — p with the same subscript. Let Gi 
be the complement of G, and g be the joint events of the G,'s with sub- 
scripts denoting the complemented events. For instance, 

So = G1G2G3G4G5G6G7 

and 

£35 = G1G2G3G4G5G6G7. 

If these events represent all the possible failure modes of the system, 
then 

P[Sp\ = PlSpg \ + P{S PSl \ + • • • + P{Spg 6 \ + P\S P g 7 ] 

+ P\Spgl 2 \ + • • • + P{Spg 23 4567} + P\Spgl234567}. (D 
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There are a total of 2 7 terms in (1). Half the terms involve the event G-j, 
which generates service outage regardless of the other events. There- 
fore, 

P{Sp\ m 1 - pjj» + P\S P g \ + • • • + P{S Pg6 \ + P{S Pgl2 \ 

+....+ P{Spg 23456 \ + P{S P g 123456 \. (2) 

The 2 6 unknown terms in (2) are to be evaluated. Since the derivations 
of each term are very similar, only the details in obtaining the more in- 
volved P|Spgi345} and P\Spg 26 ] will be given. From the definition of 
conditional probability, 

P\Sp/gi345\ = P\Sp/gi345, three or more channel failures} 
• Pjthree or more channel failures/gi 3 4 5 } 
+ P[Sp/gi34s, two channel failures}P{two channel failures/gi345} 
+ P\Sp/gi34 5 , one channel failure}P{one channel failure/gi34s}. (3) 

It is obvious that two protection channels cannot protect three failures; 
hence 

P\Sp/gi345, three or more channel failures} = 1. 

The joint event of three or more regular channel failures and 
ViGiVzTjiGsGqGi has the conditional probability 

Pjthree or more channel failures/^ 1345} 

[1 - p 2 r n - 2np?- l q r - M2n - Dp^'Vl 
x pg(l - p3")(l - p*»)(l - P l n ) Pm pl n 

P[*1346l 

The second term in (3) will be evaluated next. The various events will 
be abbreviated by their initials after their full names are introduced; e.g., 
tcf represents two channel failures. 

P\Sp/g 1345, tcf} = P\S P /g 1345, tcf, both failures in the same 

direction of transmission} • P{both failures in the same 
direction of transmission/g^s.tcf} + P{Sp/g 1345, tcf, one failure 

in each direction} • P{one failure in each direction/gi345,tcf} 
= 1 - \n(n - Op?"" Vp?(l - pjftU - P? n )(l - Pl n )p m pl n \/ 
P{gi345.tcf} + P|Sp/gi34 5 ,tcf, one failure in each 

direction} • P{ofied/gi345,tcf}. (5) 

Equation (5) follows because one protection channel cannot protect two 
failures in the same direction of transmission. The second term of (5) 
gives 

P{Sp/gi345,tcf,ofied} = P{S P /g 13 45,tcf,ofied, two 

associated detectors are not both good} • P{two associated 
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detectors are not both good/gi 34 5,tcf,ofied} 

+ P(Sp/g 13 45,tcf,ofied, two associated detectors good) 

• P{two associated detectors good/gi 3 45,tcf, ofied} 

= 1 • [nV-VpJtl -P3)(l -pj")(l -pl n )PmPl n V 
P{gi345,tcf,ofied} + P{Sp/gi 345 ,tcf,ofied,tadg} • P{tadg/g 1345 ,tcf, ofied}. 

(6) 

P{Sp/gi 34 5,tcf,ofied,tadg} = P{Sp/g 1345 ,tcf,ofied,tadg, 

both associated substitute switches good} • Pjboth 

associated substitute switches good/gi 3 45,tcf,ofied,tadg} 

+ 1 • [kv VpJpSu - pS- 2 )(i - pf >a - P 2 s )PmPi n v 

P{^i345,tcf,ofied,tadg}. (7) 

P|Sp/g 1345 ,tcf,ofied,tadg,bassg} = P{Sp/gi 345 ,tcf, ofied, 

tadg,bassg, both associated through switches good} 

• P{both associated through switches good/gi 3 45,tcf, ofied, 

tadg.bassg} + P|Sp/5 1345 ,tcf,ofied,tadg,bassg, not both 

through switches good} • Pjnot both through switches 

good/gi 345 ,ttcf,ofied,tadg,bassg} 

= i • [nV^JpJpJG -Pterin -p?"- 2 )p. 2 (i ~ P? n - 2 )p m pl n ]/ 

P{gi 3 45,tcf,oefied,tadg,bassg} + P{Sp/gi 345 ,tcf,ofied,tadg, 

bassg, nbtsg} • P{nbtsg/gi 345 ,tcf,ofied,tadg,bassg}. (8) 

For the first term in (8), it is known that not all through switches are good 
because of (7 4 . The outage probability is one because if the two failed 
channels have good through switches, the rest of the through switches 
must have failure. Finally, 

P{Sp/gi 3 45,tcf,ofied,tadg,bassg,nbtsg} = P{Sp/gi345,tcf, 

ofied,tadg,bassg,nbtsg, no other through switch failure} 

■ P{no other switch failure/gi345,tcf,ofied,tadg, 

bassg.nbtsg} + P{Sp/gi 345 ,tcf,ofied,tadg,bassg,nbtsg, other 

through switch failure} • P|other through switch 

failure/gi 3 45,tcf,ofied,tadg,bassg,nbtsg} 
= + [nV - VpJpJU - pM(l " pf )(1 " p 2 "" 2 ) 

• P?(l - p 2n_2 )PmPo' , ]/^^i345,tcf,ofied,tadg,bassg,nbtsg}. (9) 

In (9), the first conditional outage probability is zero because all the 
failures are protected by the two protection channels. The above deri- 
vations illustrate one of the basic approaches. Each event and its com- 
plement are assumed until the conditional probability of outage is either 
one or zero. 

The third term in (3) is similarly derived. 

P{Sp/g 13 45,ocf} = P{Sp/gi345,ocf, associated detector bad} 
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• Pjassociated detector bad/gi 345 ,ocf} + P{Sp/£ 134 5,ocf, associated 

detector good} • Pfassociated detector good/g 1345 ,ocf} 
= 1 • [2np? n - l q r p 2 p q d a - p 2n )(l - P 2n )PmP 2 o n ]/P\gi 34 5,ocf\ 

+ P{Sp/gi345,ocfMg\P{adg/g lM5 ,ocf\. (10) 
PfSp/gi345>ocf,adg} = P\S P /g i34 5 ,ocf,adg, associated 

substitute switch good} • Pfassociated substitute switch 
good/£i345,ocf,adg} + 1 • [2np? n - l q r plp d {l - p% n ~ l ) 

X (1 - pf n )QsP m p^ n ]/P\gi345,ocf,adg}. (11) 
P{Sp/gi345>ocf,adg,assg} = P{Sp/^ 1345 ,ocf,adg,assg, 
one other through switch bad} 
• P|one other through switch bad/gi 345 ,ocf,adg,assg} 
+ 1 '\2np? n - l q r plp d (l -P 2 d n - l )[l -pf n ~ l - (2n - Dp 2 "" 2 ?,] 

•p a (l - P s 2n " 1 )p m Po"}/P^i345,ocf,adg,assg}. (12) 
Equation (12) indicates that the status of the through switch associated 
with the failed regular channel has no effect on the outage proba- 
bility. 

P{Sp/g i345>ocf,adg,assg,ootsb} = P{Sp/gi345,ocf,adg, 

assg.ootsb, bad through switch in other direction of 
transmission} • Pjbad through switch in other 
direction/g 1345 ,ocf,adg,assg,ootsb} 
+ 1 • [2np* n - l q r p 2 p p d {l - pf-^p^n - l) P r 2 q t p s 

X(l-p 2n 1 )PmPo"]/i a tei345,ocf,adg,assg,ootsb}. (13) 
P{Sp/gi345,ocf,adg,assg,ootsb,btsiod} = PjSp/gi 345 , 

ocf,adg,assg,ootsb,btsiod, bad switch has good detector} 

• Pjbad switch has good detector/g 1345 ,ocf,adg,assg,ootsb,btsiod} 

+ 1 • [2np^q r plp d q d np^q tPs {l - pf-^p 2 "]/ 

P{gi345,ocf,adg,assg,ootsb,btsiod}. (14) 

P{Sp/§i345,ocf,adg,assg,ootsb,btsiod,bshgd} 

= P{Sp/g 1345 ,ocf,adg,assg,ootsb,btsiod,bshgd, corresponding 

substitute switch bad} 

• P{corresponding substitute 

switch bad/g 1345 ,ocf,adg,assg,ootsb,btsiod,bshgd} 

+ • P{corresponding substitute switch good/gi 345 ,ocf, 

adg,assg,ootsb,btsiod,bshgd} 
= 1 • [2np*-*q r pipj(l _ pl"-*)np?-*q tPs q s p m plny 

P(^i345,ocf,adg,assg,ootsb,btsiod,bshgd}. (15) 
From (3) through (15), 

P\SpgU45\ = p 2 p PmPl n \{x + X 3 )(l ~ Pd n )(l - p 2 ")(l - pf) 
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+ xibdd - p 2b hi - P 2 s n ) + Pdd - Pd n_1 )(i - p 2n )<?« 

+ P d(0 - p?" 1 ) • [1 - P 2 " -1 - (2n " Dpf-'fclp.U " P 8 2n_1 ) 
+ Pdd " P?" 1 ) • (n - Dpf-VtfbU " P. 2 "" 1 ) 

+ Pd9dnp 2n - 2 ^P S (1 -P 2 "" 1 ) +PS(1 -p3"" 2 )np?""W.9.] 

+ * 4 [(i - P»d - p 2 ")(i - p 2 ") + Pdd - p3"" 2 )(i " pf)(i " p?) 

+ p«l " P?- 2 )P 2 (1 " P 2 "- 2 )P 2 (1 " Ps n ~ 2 ) + Pdd " P?T 2 ) 

• (1 - pf )(1 " P? n - 2 )pf (1 " P. M )]l d6) 

where 

xi = 2np 2n - l q r 

x 2 = l-p 2n -2np 2n - l q r 

x 3 = 1 - p 2n - 2np 2n - 1 q r ~ n{2n - Dp, 2 "' 2 ? 2 

x 4 = » 2 p r 2B -V 

x = n(n - Dp 2 "" 2 ? 2 . 

To evaluate P{Spg2s\, the events 

Hi: CPU is good 
#2: ROMs are good 
H 3 : RAMs are good 

will be considered separately^Let h represent joint events similar to those 
for g, for example, h 2 = HJJ2H 3. As before, 

P\Sp/g2a) = P\Sp/g2G, both protection channels bad}P{both 

protection channels bad/g 2 6) + P\Sp/g26, one protection 
channel bad)P{one protection channel bad/g 2 6) p {'S'p/g26>bpcb} 
= P{S P /^26,bpcb,/iiJP{/ii/^26,bpcb) + P{S P /g 2 6,bpcb,h 2 } 

X P{fc 2 /g26,bpcb) + P{Sp/526,bpcb,/i 3 }P{/i 3 /526,bpcb} 
+ P{S P /g26,bpcb,/ii 2 }P{/ii2/£26,bpcb} + P{Sp/S26,bpcb,/ii 3 } 

X P{/ll 3 /g26,bpcb} + P{Sp/g 2 6,bpcb,/l23}P{h23/g26,bpcb} 

+ P{Sp/^26,bpcb,/ii23l^{^i23/526,bpcb}. (17) 

The microprocessor operation is so complicated that simplifying as- 
sumptions have to be made before (17) can be further evaluated. There 
are two kinds of CPU failures. The first kind is a partial failure which may 
not be detected by the self-checking method discussed in Appendix A. 
For instances, program counter skipping and one CPU transistor failure 
within the CPU may not always be detectable. This partial failure may 
generate false switching and result in service outage. The second kind 
is a complete failure, and the CPU operation stops altogether. No false 
switching will be made in this case, and the sanity timer will detect the 
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failure immediately. It is assumed that partial failures accounts for 20 
percent of the total CPU failures. 

When the CPU is partially failed, it executes the contents of the ROMs 
insanely. Every "instruction" has a finite probability of generating a false 
switching. The TPSS software contains approximately 4000 bytes of 
which 100 can be I/O instructions. Out of the 2n + 5 hardware addresses, 
2rc have outputs controlling the switches. If a correct parity bit and an 
appropriate output switch control bit are stored in the accumulator, an 
I/O instruction will operate the output switch. If the protection channels 
are bad, the operation of the output switch will generate service outage 
regardless of the status of the input switch. Thus the probability pi that 
any instruction will cause an outage is approximately 

= 100 1 2n 
Pl 4000 ' 4 ' 2rc + 5 ' 

When the protection channels are working, the same probability is 
now 

100 1 2n 



P2 = 



4000 8 2n+5 

because the input switch should be inactive for the false output switching 
to generate service outage. It is to be noted that false switching can also 
occur randomly if the 8-bit "instruction," the 16-bit "address," the parity 
bit, and the switch control bit happen to match the real instruction and 
address. This probability is of the order 2n/2 26 and is negligible compared 
with pi and P2. On the average, each instruction takes about 4 micro- 
seconds. Thus before restoration, about 

M c X 60 X 60 X 10 6 

»> i 

"instructions" are executed. The probability p 3 that an outage will occur 
is 



P3 = Pi +qiPi + --- + qi l l Pi 

= Pi 



1 - tfi 



l-9i 

= 1 - qf 1 . 
When the protection channels are good, the corresponding probability 
is 

p 4 = 1 " q? 1 - 

After a false switching, it is possible that insane CPU may deactivate 
the switch and restore service. It may also operate other output switches 
to generate additional service outages. These two conditional proba- 
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bilities are small. If they are ignored, the outage probability assuming 
partial CPU failure and bad protection channels is then p s t/fi c . If only 
one of the two protection channels is bad, let 

100 In 100 1 n 

P5 = ^ZZ---~ — 77 + 



4000 8 2n + 5 4000 4 2n + 5 
The outage probability is p 6 t/n c where 

Pe = 1 - tf 1 . 
When a memory failure occurs, the program counter jumps to an ar- 
bitrary location. The initial effect is somewhat like that of a partially 
failed CPU. Experiments indicate that outage is unlikely to occur if it has 
not occurred during the initial period. Since 25 out of the 4000 bytes are 
used to activate the output switches in normal program operation, a jump 
to these bytes will cause a false switching. Therefore, the false switching 
probability is 

25 

Pi = >" Pi 

H 4000 
or 

25 

depending on whether the protection channels are bad or good. If only 
one of the two protection channels is bad, the probability is 

25 
P9 4000 y 
It will be assumed that all RAM failures can be detected. Most of the RAM 
bytes are used for stack. The effects of the ROM and the RAM failures 
are assumed to be identical, but their restoration times are different 
because not all ROM failures are self-detectable. When the CPU fails, 
memory failures are assumed to have no effect on the system. This makes 
the evaluation of the fourth, the fifth, and the last terms in (17) unnec- 
essary once the first term is evaluated. It is further assumed that when 
there are both ROM and RAM failures, the trouble can be detected im- 
mediately. Given the previous assumption, then 

P\Sp/g26,bpch,hi\ = P{Sp/g 2 6,bpcb,fti, complete 

f ailurejPjcomplete failure/g 2 6>bpcb,h i} 
+ P{Sp/g 2 6.bpcb,/ii, partial failure}P{partial failure/g 2 6,bpcb,hi} 
_ Q t Pzt Pi O ql0.2q c p e p a 
He P{#26,bpcb,hi} ' 
where 

Pio = (PrPdPtPsPo) 2n - (18) 
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nbp/g26,opcbM\ = — — — f — rr~T 

He / , ^26,bpcb,/l 2 | 



.2, 



pi C /„ U„„U U I _ n PlOQpPcPeQa 

nbp/g26,opcb,h3\ = Pi— f — r~r~x 

P\g26,hpcb,h 3 \ 

P[S P /g 26> bpcb,h 2 3\ = Pi J >l0q l PC l e l a t 



Hence 



(19) 



P{Sp,526,bpcb) = p 10 [— 0.2 g c + —p c q e p a + PlPcQa] 

I Mc Me 

The expression P|Sp,g 2 6,opcb) can be similarly evaluated. Finally, 

P\Spg 2 6\ = PlO ZPpQp I — 0.2<? c + — PcQePa + P 9 Pc<?a 1 
L Mc Me J 

+ gj ^ 0.2<? c + ^ Pc q ePa + p 7 Pc9al }• (20) 

After deriving (16) and (20), the remaining terms in (2) are easy to 
obtain. They will not be given here. Thus the outage probability with 
protection switching P\S P \ is obtained from (2). It should be emphasized 
that, because there are hidden failures, multiple equipment failures 
cannot be neglected in evaluating the various terms in (2). In fact, the 
term that contributes the most to the outage probability is P\S p gi 3 r,}, 
which involves both of the undetectable failures (detector and substitute 
switch). 

Since the detectors used to generate alarms do not affect signal 
transmission, the outage probability without protection switching is 
simply 

P{S}=l-p r 2 ". (21) 

The improvement factor is 

-SB- 

Next, the probabilities of activity with and without protection 
switching will be considered. The additional events of interest are 

A: activity without protection switching 
Ap\ activity with protection switching 
G5: protection detectors are good. 

G5 is redefined because protection detector failures generates mainte- 
nance activities, but the hidden substitute switch failures are assumed 
to cause no activity. To calculate the probability of activity with pro- 
tection switching, notice that whenever G\, G 4 , and (7 7 occur, there will 
definitely be maintenance activity. Furthermore, the events U 2 and (7 5 
are detectable when G 6 is true. Therefore 
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P\A P \ = 1 - (PrPtPo) 2n + (PrPtP0) 2n Pmtt ~ P$Pd) + P\Apgo\ 

+ P\A P g 3 \ + P\A P g 6 \ + P\Apg 2 &] + P{A P g^\ + P\A P g 56 \ 

+ P\A P g236\ + P\A P g256) + P[A P g356\ + P\Apg 2 ^\- (23) 

In (23), P\A P go\ is always zero. The last seven terms are negligible com- 
pared with P[Apg 3 } and P\Apg 6 }. It is assumed that 10 percent of the CPU 
and the ROM failures will not generate alarm. The derivation of P\Apg e \ 
is similar to that of (17). For example, 

P\A P /g 6 hi\ = P\A P /g 6 ,hi, undetectable 

f ailure}P{undetectable failure/^ i) 
+ P[Ap/g 6 ,hi, detectable failure}P{detectable failure/ge.hi} 

= + — • (PrPdPtPo) 2n (P P PD) 2 ' 0.9 • qcPePJP\gehi}. 

Thus, 

P\A P g 6 \ = (PrPdPtPo) 2n (p P PD) 2 f 0.9 — q c 

L Mc 

+ 0.9 — PcQePa + PcQa ■ (24) 
Me J 

If it is assumed that, when a detector fails, the probability that it is 
stuck to an ON state is 0.25, then 

P\A P /g 3 \ = P[Ap/g 3 , one detector bad}P{one 

detector bad/g 3 } + • • • + P\A P /gz, 2n detectors 

bad)P{2n detectors bad}. (25) 

The ith term in (25) is 

P\Ap/g 3 ,idb\ = P\A P /g 3,idb, all bad detectors 

on}P{all bad detectors on/g 3 ,idb} 

+ P{Ap/#3,idb, some bad detectors off} 

• P{some bad detectors off/g 3 ,idb} 

(PrPtP0) 2n (PpPD) 2 P m (^) P^rid " 0.25') 

Hd ' P{#3,idb} 

Therefore, 
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t 2n n . . 

P\A P g 3 \ = —Pm(p P PD) 2 (PrPtPo) 2n E pf- l q l d(\ ~ 0.250. (26) 

Equations (23) through (26) yield the probability of activity with 
protection switching P\Ap\. The probability of activity without pro- 
tection switching P\A\ is simply 



P\A\ = 1 - p r 2 " + -p? Z ( 2n ) Pt*irf(l - 0.25'), 



t „ 2n /2n> 

^ia) = i-pr+- 

where 



Pb = 



1 + XdMb 

and 

1 

is the detector restoration time without protection switching. The ac- 
tivity factor is given by 
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